# Load packages
library(ggplot2)
library(dplyr)
# Load data
library(palmerpenguins)
## Error in library(palmerpenguins): there is no package called 'palmerpenguins'
data(penguins)Multiple linear regression: model building (part 2) (Notes)
STAT 155
Notes
Learning goals
By the end of this lesson, you should be able to:
- Explain when variables are redundant or multicollinear.
- Relate redundancy and multicollinearity to coefficient estimates and \(R^2\).
- Explain why adjusted \(R^2\) is preferable to multiple \(R^2\) when comparing models with different numbers of predictors.
Readings and videos
Today is a day to discover ideas, so no readings or videos to go through before class, but if you want to see today’s ideas presented in a different way, you can take a look at the following after class:
- Reading: Section 3.9.5 in the STAT 155 Notes
- Video: Redundancy and Multicollinearity
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Context: Last time we talked about different types of research questions (descriptive, predictive, and causal) and worked through how to translate these questions to data explorations and modeling. Also recall that when we talked about model evaluation previously, we asked if the model was correct, strong, and fair.
Today we’ll revisit predictive research questions today and some considerations when trying to build strong models. In particular, we’ll consider nuances and limitations to indiscriminately adding more predictors to our model. To explore these ideas, enter install.packages("palmerpenguins") in the Console. Then load the following data on penguins:
You can find a codebook for these data by typing ?penguins in your console (not Rmd). Our goal throughout will be to build a model of bill lengths (in mm):
(Art by @allison_horst)
To get started, the flipper_length_mm variable currently measures flipper length in mm. Let’s create and save a new variable named flipper_length_cm which measures flipper length in cm. NOTE: There are 10mm in a cm.
penguins <- penguins %>%
mutate(flipper_length_cm = flipper_length_mm / 10)
## Error in `mutate()`:
## ℹ In argument: `flipper_length_cm = flipper_length_mm/10`.
## Caused by error:
## ! object 'flipper_length_mm' not foundRun the code chunk below to build a bunch of models that you’ll be exploring in the exercises:
penguin_model_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found
penguin_model_2 <- lm(bill_length_mm ~ flipper_length_cm, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found
penguin_model_3 <- lm(bill_length_mm ~ flipper_length_mm + flipper_length_cm, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found
penguin_model_4 <- lm(bill_length_mm ~ body_mass_g, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found
penguin_model_5 <- lm(bill_length_mm ~ flipper_length_mm + body_mass_g, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not foundExercise 1: Modeling bill length by flipper length
What can a penguin’s flipper (arm) length tell us about their bill length? To answer this question, we’ll consider 3 of our models:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_2 |
flipper_length_cm |
penguin_model_3 |
flipper_length_mm + flipper_length_cm |
Plots of the first two models are below:
ggplot(penguins, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'flipper_length_mm' not found
ggplot(penguins, aes(y = bill_length_mm, x = flipper_length_cm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'flipper_length_cm' not foundBefore examining the model summaries, check your intuition. Do you think the
penguin_model_2R-squared will be less than, equal to, or more than that ofpenguin_model_1? Similarly, how do you think thepenguin_model_3R-squared will compare to that ofpenguin_model_1?Check your intuition: Examine the R-squared values for the three penguin models and summarize how these compare.
summary(penguin_model_1)$r.squared
## Error: object 'penguin_model_1' not found
summary(penguin_model_2)$r.squared
## Error: object 'penguin_model_2' not found
summary(penguin_model_3)$r.squared
## Error: object 'penguin_model_3' not found- Explain why your observation in part b makes sense. Support your reasoning with a plot of just the 2 predictors:
flipper_length_mmvsflipper_length_cm.
- OPTIONAL challenge: In
summary(penguin_model_3), theflipper_length_cmcoefficient isNA. Explain why this makes sense. HINT: Thinking about what you learned about controlling for covariates, why wouldn’t it make sense to interpret this coefficient? BONUS: For those of you that have taken MATH 236, this has to do with matrices that are not of full rank!
Exercise 2: Incorporating body_mass_g
In this exercise you’ll consider 3 models of bill_length_mm:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_4 |
body_mass_g |
penguin_model_5 |
flipper_length_mm + body_mass_g |
Which is the better predictor of
bill_length_mm:flipper_length_mmorbody_mass_g? Provide some numerical evidence.penguin_model_5incorporates bothflipper_length_mmandbody_mass_gas predictors. Before examining a model summary, ask your gut: Will thepenguin_model_5R-squared be close to 0.35, close to 0.43, or greater than 0.6?Check your intuition. Report the
penguin_model_5R-squared and summarize how this compares to that ofpenguin_model_1andpenguin_model_4.Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 predictors:
flipper_length_mmvsbody_mass_g.
Exercise 3: Redundancy and Multicollinearity
The exercises above have illustrated special phenomena in multivariate modeling:
- two predictors are redundant if they contain the same exact information
- two predictors are multicollinear if they are strongly associated (they contain very similar information) but are not completely redundant.
Recall that we examined 5 models:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_2 |
flipper_length_cm |
penguin_model_3 |
flipper_length_mm + flipper_length_cm |
penguin_model_4 |
body_mass_g |
penguin_model_5 |
flipper_length_mm + body_mass_g |
- Which model had redundant predictors and which predictors were these?
- Which model had multicollinear predictors and which predictors were these?
- In general, what happens to the R-squared value if we add a redundant predictor to a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?
- Similarly, what happens to the R-squared value if we add a multicollinear predictor to a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?
Exercise 4: Considerations for strong models
Let’s dive deeper into important considerations when building a strong model. We’ll use a subset of the penguins data for exploring these ideas.
# For illustration purposes only, take a sample of 10 penguins.
# We'll discuss this code later in the course!
set.seed(155)
penguins_small <- sample_n(penguins, size = 10) %>%
mutate(flipper_length_mm = jitter(flipper_length_mm))
## Error in `mutate()`:
## ℹ In argument: `flipper_length_mm = jitter(flipper_length_mm)`.
## Caused by error:
## ! object 'flipper_length_mm' not foundConsider 3 models of bill length:
# A model with one predictor (flipper_length_mm)
poly_mod_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins_small)
## Error in eval(mf, parent.frame()): object 'penguins_small' not found
# A model with two predictors (flipper_length_mm and flipper_length_mm^2)
poly_mod_2 <- lm(bill_length_mm ~ poly(flipper_length_mm, 2), penguins_small)
## Error in eval(mf, parent.frame()): object 'penguins_small' not found
# A model with nine predictors (flipper_length_mm, flipper_length_mm^2, ... on up to flipper_length_mm^9)
poly_mod_9 <- lm(bill_length_mm ~ poly(flipper_length_mm, 9), penguins_small)
## Error in eval(mf, parent.frame()): object 'penguins_small' not found- Before doing any analysis, which of the three models do you think will be best?
- Calculate the R-squared values of these 3 models. Which model do you think is best?
summary(poly_mod_1)$r.squared
## Error: object 'poly_mod_1' not found
summary(poly_mod_2)$r.squared
## Error: object 'poly_mod_2' not found
summary(poly_mod_9)$r.squared
## Error: object 'poly_mod_9' not found- Check out plots depicting the relationship estimated by these 3 models. Which model do you think is best?
# A plot of model 1
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## Error: object 'penguins_small' not found# A plot of model 2
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
## Error: object 'penguins_small' not found# A plot of model 9
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 9), se = FALSE)
## Error: object 'penguins_small' not foundExercise 5: Reflecting on these investigations
- List 3 of your favorite foods. Now imagine making a dish that combines all of these foods. Do you think it would taste good?
- Too many good things doesn’t make necessarily make a better thing. Model 9 demonstrates that it’s always possible to get a perfect R-squared of 1, but there are drawbacks to putting more and more predictors into our model. Answer the following about model 9:
- How easy would it be to interpret this model?
- Would you say that this model captures the general trend of the relationship between
bill_length_mmandflipper_length_mm? - How well do you think this model would generalize to penguins that were not included in the
penguins_smallsample? For example, would you expect these new penguins to fall on the wiggly model 9 curve?
Exercise 6: Overfitting
Model 9 provides an example of a model that is overfit to our sample data. That is, it picks up the tiny details of our data at the cost of losing the more general trends of the relationship of interest. Check out the following xkcd comic. Which plot pokes fun at overfitting?
Some other goodies:
Exercise 7: Questioning R-squared
Zooming out, explain some limitations of relying on R-squared to measure the strength / usefulness of a model.
Exercise 8: Adjusted R-squared
We’ve seen that, unless a predictor is redundant with another, R-squared will increase. Even if that predictor is strongly multicollinear with another. Even if that predictor isn’t a good predictor! Thus if we only look at R-squared we might get overly greedy. We can check our greedy impulses a few ways. We take a more in depth approach in STAT 253, but one quick alternative is reported right in our model summary() tables. Adjusted R-squared includes a penalty for incorporating more and more predictors. Mathematically (where \(n\) is the sample size and \(p\) is the number of non-intercept coefficients):
\[ \text{Adjusted } R^2 = 1 - (1 - R^2) \left( \frac{n-1}{n-p-1} \right) \]
Thus unlike R-squared, Adjusted R-squared can decrease when the information that a predictor contributes to a model isn’t enough to offset the complexity it adds to that model. Consider two models:
example_1 <- lm(bill_length_mm ~ species, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found
example_2 <- lm(bill_length_mm ~ species + island, penguins)
## Error in eval(predvars, data, env): object 'bill_length_mm' not found- Check out the summaries for the 2 example models. In general, how does a model’s Adjusted R-squared compare to the R-squared? Is it greater, less than, or equal to the R-squared?
- How did the R-squared change from example model 1 to model 2? How did the Adjusted R-squared change?
- Explain what it is about
islandthat resulted in a decreased Adjusted R-squared. Note: it’s not necessarily the case thatislandis a bad predictor on its own!
Reflection
Today we looked at some cautions surrounding indiscriminately adding variables to a model. Summarize key takeaways.
Response: Put your response here.
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!